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1  Introduction 


Background 

Laws  and  regulations  mandate  that  Army  instedlations  monitor  emissions  from 
industrial  processes,  and  maintain  their  processes  within  emissions  standards. 
Failure  to  follow  regulations  may  result  in  health  and  safety  hazards  to  installa¬ 
tion  employees  and  persons  living  in  the  srurounding  areas,  and  may  incur 
heavy  fines  for  the  installations.  Army  installations  that  have  industrial  opera¬ 
tions  commonly  use  pollution  control  equipment  (PCE)  to  monitor  emissions,  and 
to  stay  within  regulatory  and  legal  limits.  PCE  has  become  an  integral  part  of 
manufacturing  systems. 

Nevertheless,  PCE  is  no  “easy  cme”  for  all  problems  of  hazardous  emissions. 
PCE  must  be  carefully  selected  and  used  only  in  those  applications  for  which  it 
was  designed.  Users  must  take  cEffe  that  PCE  is  well  matched  when  it  is  simply 
“added-on”  to  an  established  piece  of  machinery.  Also,  PCE,  like  any  other  com¬ 
plex  machinery,  requires  maintenance  for  optimal  performance.  If  PCE  is  not 
carefully  operated  and  maintained,  emissions  from  manufacturing  processes 
may  exceed  hmits  and  cause  environmental  hazards.  Moreover,  PCE  madnte- 
nance  is  expensive  and  labor  intensive,  and  often  is  not  a  high  priority  item  in  a 
manufacttuing  facihty. 

Before  investing  dollars  in  expensive  sensors  and  maintenance  programs,  manu¬ 
facturing  installations  should  look  at  other  available  options.  Options  include 
using  advanced  technologies  to  detect  problem  conditions,  collecting  data  to  pre¬ 
dict  and  determine  the  cause  of  failures,  using  dynamic  modeling  techniques  to 
model  the  system  or  components  that  have  a  higher  than  expected  frequency  of 
failiue,  and  verifying  the  efficacy  of  the  model  with  data  collected  from  test  runs. 

An  earlier  CERL  publication  (Northrup  et  al.,  September  1998)  discussed  prob¬ 
lems  of  design  and  their  flaws  and  the  problems  associated  with  manufacturers’ 
statistical  analysis  for  failure  mode  of  manufacturing.  Other  previous  CERL 
work  (Chalifoux,  Northrup,  and  Baird  1999;  Chalifoux,  Northrup,  and  Chan 
1999)  have  shown  that  Reliability  Centered  Maintenance  (RCM)  usually  depends 
upon  test  regimens  rather  than  approaching  the  subject  from  a  statistical  view¬ 
point  even  though  statistics  have  been  used  in  manufacturing  quite  successfully. 
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Statistical  maintenance  modeling,  even  when  approached  from  different  initial 
viewpoints,  reveals  the  impact  of  different  maintenance  policies.  Until  recently, 
large  scale  systems  have  escaped  effective  analysis.  This  study  attempted  to 
model  large  systems  by  using  a  model  based  on  queuing  theory,  to  produce  a 
model  that  can  reduce  equipment  downtime  and  help  optimize  maintenance  pol¬ 
icy  for  minimal  ecological  impact. 


Objectives 

The  objectives  of  the  project  were  to  investigate  dynamic  modeling  and  advanced 
maintenance  technologies  and  the  use  of  these  technologies  in  detecting  systemic 
problem  areas.  Another  objective  was  to  develop  a  dynamic  computer  model 
based  on  queuing  theory  and  using  off-the-shelf  software  to  predict  and  analyze 
failure  distribution  in  systems. 


Approach 

Advanced  maintenance  technologies  and  techniques  for  use  in  detecting  problem 
conditions  in  industrial  systems  were  studied.  These  technologies  and  tech¬ 
niques  include  d3rnamic  modeling,  acoustic  emission  methods.  Failure  Modes  and 
Effects  Analysis  (FMEA),  spectrum  and  waveform  analysis,  ultrasound,  infrared 
thermography,  and  vibration  monitoring.  Reliability  Theorem  (RT)  and  its  ap¬ 
plication  to  RCM,  including  a  newer  approach  to  RCM  from  queuing  analysis, 
provided  the  basis  for  developing  a  computer  model  based  on  queuing  theory  to 
apply  to  complex  systems. 


Mode  of  Technology  Transfer 

It  is  anticipated  that  the  queuing  theory  computer  model  and  documentation  will 
be  available  on  the  CERL  web  page,  available  at  URL: 


http://www.cecer.army.mil/ 
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2  Dynamic  Modeling  —  A  Technique  for 
Operation  and  Maintenance  of  Pollution 
Control  Equipment 

Overview  of  Dynamic  Modeling 

Researchers  use  models  to  explain  real-world  situations.  Models  begin  as  ab¬ 
stract  ideas  about  reality.  To  implement  a  model,  one  must  examine  the  assump¬ 
tions  underlying  the  abstract  ideas  used  to  create  the  model.  Models  allow  us  to 
explain  and  sometimes  predict  the  outcomes  of  the  structxiral  and  dynamic  as¬ 
sumptions  that  one  makes  in  abstraction.  Developing  a  model  can  be  a  compli¬ 
cated  procedure,  but  the  process  can  be  simplified  by  identifying  a  set  of  general 
procedtures.  Figure  1  shows  a  simplified  form  of  these  general  procedures. 
Sometimes  real  events  cause  us  to  look  at  particulars  of  these  events,  and  in 
turn,  these  particular  interests  may  be  restated  as  a  set  of  questions  regarding 
the  events  and  what  brought  them  about.  By  identifying  key  elements  of  proc¬ 
esses  and  observations,  we  can  form  an  abstraction  of  the  real  events.  These  key 
elements  include  both  the  veiriables  that  describe  the  events  and  the  relation¬ 
ships  among  the  variables.  Ultimately,  both  the  variables  and  their  relation¬ 
ships  establish  the  model’s  structure.  We  can  then  use  the  model  to  formulate 
conclusions  and  predict  the  outcome  of  futme  events.  Comparing  conclusions 
and  predications  to  real  events  may  reveal  that  a  model  is  inaccurate,  accept¬ 
able,  or  needs  revisions.  Model  building  is  a  continuum  of  revisions,  compari¬ 
sons,  and  changes  that  all  lead  to  a  better  understanding  of  the  reality  in  ques¬ 
tion. 

Models  may  represent  a  specific  phenomenon  at  a  single  point  in  time,  such  as 
the  location  and  size  of  a  city,  or  they  may  represent  rates  of  change  over  time, 
such  as  the  rate  of  migration  to  or  from  a  city.  The  latter  type  of  model  is  a  “dy¬ 
namic  model.”  The  present  study  used  the  principles  of  dynamic  modeling  to 
construct  computer  models  to  predict  PCE  maintenance.  Computer  models  help 
to  clarify  real-world  processes  because  computer  simulation  can  be  applied  to 
imitate  the  actual  forces  presumed  to  cause  a  system’s  behavior. 
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Figure  1.  The  process  of  model  construction  (Hannon  and  Matthias  1994,  p  4). 


Initially,  models  should  be  kept  simple.  They  may  otherwise  exceed  the  com¬ 
plexity  of  the  real-world  system  that  they  were  meant  to  explain.  Complexity 
may  be  added  later  if  the  initial  model  does  not  produce  the  real  effects.  Models 
are  causal  because  they  are  built  from  general  rules  that  demonstrate  how  each 
element  in  a  system  responds  to  changes  of  other  elements.  A  model  is  a  device 
that  keeps  us  organized  during  data  gathering  and  evaluating  knowledge  about 
the  mechanisms  that  lead  to  changes  in  a  system. 


Systems 

Before  developing  a  model  of  a  system,  it  is  important  to  understand  some  as¬ 
pects  of  systems.  Systems  include  elements  called  variables.  These  are  further 
described  as  state  and  control  variables.  State  variables  may  be  conserved  or 
nonconserved.  Conserved  state  variables  denote  an  accumulation  of  materials  or 
information,  such  as  population.  Nonconserved  state  variables  are  indicators  of 
some  part  of  a  system’s  condition.  Examples  of  nonconserved  state  variables  are 
price  and  temperature. 

The  elements  in  a  system  that  represent  changes  in  state  variables  are  control 
variables  or  flows,  and  they  are  responsible  for  updating  state  variables.  The 
“number  of  barrels  of  waste  extracted  per  period”  is  a  control  variable  as  it 
changes  the  state  variable  “reserves  of  waste.” 
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Another  aspect  of  a  system  is  the  interaction  of  components  within  the  system, 
in  the  form  of  feedback,  both  positive  and  negative.  Feedback  results  from 
changes  in  a  system  component  that  cause  changes  in  other  components. 
Ultimately,  these  latter  changes  affect  the  component  that  originally  initiated 
the  change.  If  the  series  of  chemges  strengthen  the  original  process,  then  the 
feedback  is  positive.  On  the  other  hand,  if  the  original  change  is  counteracted  by 
the  series  of  changes,  then  the  feedback  is  negative.  Positive  feedback  processes 
magnify  disturbance  and  move  the  system  away  from  equihbrixmi.  Negative 
feedback  processes  by  counteracting  disturbances  lead  a  system  toward  steady 
state. 


Model  Building 

According  to  Hannon  and  Matthias  (1994,  p  7),  the  model  building  process  in¬ 
cludes  a  series  of  steps: 

1.  Define  problem  and  goals.  Carefully  structure  the  questions  regarding  the  prob¬ 
lem  you  require  the  model  to  answer.  Decide  whether  the  goals  of  the  model  are 
to  be  descriptive  or  predictive. 

2.  Designate  the  state  variables.  Keep  this  step  simple  and  denote  units  for  the 
variables. 

3.  Select  control  variables.  Choose  control  variables  and  corresponding  flow  controls 
into  and  out  of  the  state  variables.  Record  which  state  variables  are  donors  and 
which  are  recipients  in  relation  to  the  control  variables.  Also,  note  the  control 
variable  units.  At  this  step,  use  one  type  of  control  to  represent  a  class  of  similar 
controls. 

4.  Select  parameters  for  the  control  variables.  When  selecting  parameters  for  con¬ 
trol  variables,  ensure  that  you  know  to  which  function  the  controls  and  their  pa¬ 
rameters  relate.  Note  the  units  for  parameters. 

5.  Check  the  resulting  model.  Check  the  resvdting  model  for  violations  of  laws,  con¬ 
tinuity  requirements,  and  consistency  of  units. 

6.  See  how  the  model  will  work.  Choose  the  following:  a  time  horizon,  which  you 
will  use  to  look  at  the  dynamic  behavior  of  the  model,  the  duration  of  each  time 
interval  for  the  updating  of  state  variables,  and  the  procedure  for  calculating 
flows.  Using  a  graph,  estimate  the  variation  of  the  state  variable  curves. 

7.  Run  the  model.  Choose  different  lengths  for  each  time  interval  and  alternate  the 
integration  procedures  to  see  if  the  results  are  the  same. 

8.  Vary  the  parameters.  Vary  the  parameters  to  make  sure  the  graph  still  makes 
sense.  Revise  the  model  to  incorporate  revisions  to  errors  and  irregularities. 
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9.  Compare  the  results  to  experimental  data.  To  do  this,  you  may  have  to  close  off 
sections  of  the  model  so  that  you  can  simiolate  a  laboratory  experiment. 

10.  Revise  the  parameters.  Revise  parameters  to  include  exceptions  to  the  experi¬ 
mental  results  and  to  increase  the  complexity  of  the  model,  if  necessary. 


Modeling  Nonlinear  Relationships 

Linearity  refers  to  lines,  planes,  and  (flat)  three-dimensional  space,  and  these 
objects  always  appear  the  same  from  any  aspect.  A  nonlinear  object  such  as  a 
sphere  appears  different  on  different  scales.  When  it  is  viewed  up  close,  it  ap¬ 
pears  as  a  plane,  whereas  from  a  distance  it  looks  like  a  point.  Nonlinear  rela¬ 
tionships  occur  when  a  control  variable  does  not  depend  linearly  on  other  vari¬ 
ables,  but  for  example,  varies  with  the  square  root  of  another  variable. 

Nonlinearities  are  especially  important  in  developing  models,  as  many  real  sys¬ 
tems  are  ruled  by  nonlinearities.  Usually,  nonlinear  systems  do  not  have  specific 
mathematical  solutions  and  often  include  characteristics  that  were  not  expected 
or  that  were  incorrectly  identified.  These  imexpected  characteristics  include 
chaos.  In  mathematical  terms,  chaos  is  unpredictable  long  time  behavior  that 
arises  in  a  deterministic  dynamical  system,  due  to  sensitivity  to  initial  condi¬ 
tions.  A  dynamical  system  is  one  that  has  a  state  space,  whose  coordinates  ex¬ 
plain  its  dynamical  state  at  any  instant  of  time.  A  d3mamical  system  possesses 
also  a  dynamical  rule  that  specifies  the  imminent  future  trend  of  all  state  vari¬ 
ables  from  the  present  values  of  the  same  state  variables.  Dynamical  systems 
may  be  deterministic  or  stochastic.  Most  nonlinear  science  deals  with  determi¬ 
nistic  systems.  A  d3Tiamical  system  is  deterministic  if  a  vmique  resultant  to  each 
state  exists,  and  a  dynamical  system  is  stochastic  if  more  than  one  resultant  se¬ 
lected  from  a  probability  distribution  exists.  Dynamical  systems  can  also  have 
discrete  or  continuous  time. 

In  discrete  event  models,  there  are  events  and  specific  time  intervals  between 
the  events.  The  occmrence  of  events  drives  the  model  in  discrete  event  models. 
Computer  models  of  physical  systems  essentially  are  discrete  approximations 
where  a  series  of  discrete  events  represents  chsmges  in  system  state.  On  the 
other  hand,  time  moves  forward  at  regular  intervals;  there  is  a  direct  relation¬ 
ship  between  processes  and  time,  in  continuous  models.  Deterministic  differen¬ 
tial  equations  and  algebraic  equations  are  reqviired  to  describe  continuous 
simtdation  models. 
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3  Advanced  Technologies  for  Operation 
and  Maintenance  of  Pollution  Control 
Equipment 


The  researchers  for  this  project  studied  advanced  technologies  available  on  the 
commercial  market.  Although  many  of  these  individual  technologies  are  appli¬ 
cable  to  solve  operations  and  medntenance  problems  at  industrial  installations, 
due  to  constraints  of  both  budgets  and  personnel  in  the  military  we  did  not  pro¬ 
pose  their  purchase  and  use.  Instead  we  used  the  knowledge  of  the  technologies 
only  as  a  basis  to  advance  our  research  of  cyclic  problems  in  undefined  failure 
modes.  The  following  examples,  therefore,  are  included  as  reference  material  for 
those  who  may  want  to  follow  up  on  our  investigations. 


Distribution  Faiiure  Prediction  System 

This  state-of-the-art  system  is  designed  to  detect  failure  symptoms  and  incipient 
failures  in  distribution  feeders.  It  allows  the  user  to  perform  diagnosis  and  con¬ 
dition  monitoring  in  distribution  lines.  By  preventing  distribution  failures, 
power  quality  improves  and  optimal  equipment  operation  is  achieved.  Figure  2 
shows  a  block  diagram  of  the  distribution  failure  prediction  system. 


Figure  2.  Block  diagram  of  distribution  failure  prediction  system 
(adapted  from  System  Block  Diagram. 
httD://www.kevin.co.kr/ena/Product/1/d  7.htmn 


14 


CERL  TR  99/88 


Features  of  the  system  include: 

•  an  expert  system  that  used  artificial  intelligent  language 

•  an  ability  to  predict  the  insvilator  failure 

•  intelligent  decisionmaking  using  frequency  parameters. 


Object-Oriented  Cognitive  Decision  Support  Engine 

This  system  searches  for  data  and  returns  the  sources  that  are  of  partictilar  in¬ 
terest  to  the  decisionmakers’  problem.  The  system  uses  a  “smart  scout”  to  track 
and  identify  data  from  static  smd  dynamic  databases,  real-time  instruments,  im¬ 
ages  (visual,  radar,  satellite),  and  direct  input.  It  then  notifies  the  decision¬ 
maker  of  changes  in  the  data  that  affect  the  decision. 

This  engine  includes  the  following  capabilities: 

•  It  manages  the  cognitive  process  and  data  mining  operations  through  intelli¬ 
gent  agents. 

•  It  develops  data  “scouts”  that  look  at  data  soiirces  and  report  changes. 

•  It  improves  decisions  based  on  past  performance. 

•  It  allows  changes  in  information  that  are  reflected  immediately  in  decisions. 

•  It  consolidates  information  from  multiple  distributed  sources  for  human  ac¬ 
tion. 

Source:  http://www.nasec.ctc.com/manuknow/technica.htm 


Failure  Mode  and  Effect  Analysis  (FMEA) 

This  is  a  technique  used  to  identify  and  eliminate  known  or  potential  problems 
from  a  system.  FMEA  should  be  integrated  into  the  initial  design  review,  and  it 
should  be  an  ongoing  process  throughout  the  life  of  the  product. 

Failure  mode  is  a  function  of  a  part  number.  In  a  given  system,  each  component 
part  number  is  euialyzed  to  ascertain  possible  failure  modes,  for  example,  open, 
short,  mechanical  failure,  etc.  Theoretically,  each  part  has  limitless  potential 
failure  modes,  but  in  reality  there  is  a  point  of  diminishing  returns  where  the 
cost  added  exceeds  the  derived  benefits.  Failure  modes  that  have  the  same  effect 
may  be  combined  and  separated  later  if  necessary.  Initially,  FMEA  should  in¬ 
clude  all  system  components  that  would  be  repaired  or  replaced  during  a  main¬ 
tenance  activity,  and  other  failure  modes  may  be  added  as  failures  occxu. 
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The  effect  of  a  part  failure  relies  on  how  the  part  functions  in  the  system.  Even 
though  two  valves  have  the  same  part  number,  the  effect  of  a  failure  rests  on 
what  each  valve  is  governing.  It  is  therefore  imperative  that  each  component  in 
a  system  have  a  unique  symbol  independent  of  the  part  number. 

The  relative  importance  of  a  failm-e  mode  is  denoted  by  its  RPN  number,  which 
is  calcidated  from  the  formula: 

RPN  =  S*0*D  Eq.l 


where: 

S  =  the  severity  of  a  potential  failure,  which  is  assigned  a  value  from  1  to 
10,  where  10  is  the  most  severe  failvue. 

O  =  the  occurrence  of  the  failure  (Relative  Failrue  Rate)  which  is  assigned 

a  value  from  1  to  10,  where  10  is  the  highest  failure  rate. 

D  =  the  ability  to  detect  a  fadiue,  which  is  assigned  a  value  from  1  to  10, 
where  10  is  the  most  difficult  to  detect. 

The  fact  that  FMEA  has  not  been  widely  used  is  due,  according  to  discussions 
between  engineers  and  researchers  Jeong  and  lizuka,  to  the  following: 

1.  FMEA  is  a  time-consuming  technique  that  gives  iinsatisfactory  results. 

2.  The  prediction  of  failtue  mode  depends  too  much  on  a  predictor’s  experience  and 
organizational  information.  Hence,  failure  mode  omission  will  result. 

3.  Evaluation  of  the  seriousness  of  the  failure  mode  is  difficidt. 

4.  The  experience  gained  from  an  FMEA  is  diflBcult  to  reuse. 

In  response  to  the  above  criticisms  of  FMEA,  Jeong  and  lizuka  proposed  a  tech¬ 
nique  to  prevent  the  above  difficvilties  with  FMEA.  First,  they  investigated 
problems  associated  with  failure  mode  prediction  and  suggested  a  method  to 
predict  failure  modes  effectively.  Jeong  and  lizuka  proposed  three  approaches  to 
effectively  predicting  failure  modes  with  less  omission:  (1)  failure  mode  predic¬ 
tion  based  on  “association”  (Yeong  and  lizuka  1996),  (2)  preparation  of  Failure 
Mode  Mechanism  (FMM)  diagram  (Yeong  and  lizuka  1997),  and  (3)  failure  mode 
prediction  based  on  “hierarchy”  (Yeong  and  lizuka  1996).  Second,  they  presented 
a  method  to  analyze  the  cause  and  effect  of  failvtre  modes  effectively.  Third,  they 
applied  these  methods  to  refine  the  above  proposals,  to  ensure  the  effective  ap¬ 
plication  of  FMEA. 
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Equipment  Monitoring 

Tracer,  whose  customers  include  the  U.S.  Navy,  the  National  Aeronautics  and 
Space  Administration  (NASA),  ARPA,  and  several  oil  companies,  defines  six  dif¬ 
ferent  monitoring  functions  as  follows:  (1)  continuous  protection  against  cata¬ 
strophic  failure,  (2)  early  detection  of  machine  abnormalities,  (3)  accurate  diag¬ 
nosis  of  problem,  (4)  assessment  of  level  severity,  (5)  accimate  prediction  of  future 
machine  condition  versus  time  (including  time  to  failure),  and  (6)  generation  of 
feedback  information  for  control  of  machine  operational  characteristics.  Each  of 
these  functions,  although  requiring  different  monitoring  system  design,  begins 
with  favdt  mode  selection  and  data  collection. 


Rotating  Machinery  Simulator 

This  technology  is  designed  as  an  educational  tool  for  the  study  of  vibration  due 
to  rotating  machinery.  It  allows  an  operator  to  learn  the  principles  of  rotating 
machinery  in  a  controlled  environment.  The  main  features  of  a  commercially 
available  rotating  machinery  vibration  training  tool  are: 

•  rotating  machinery  vibration  simulation 

•  multi-plane  balancing 

•  dynamically  induced  structural  vibration 

•  sixteen  weight  positions  per  plane 

•  two  100  mV/g  accelerometers 

•  Magnetic  tachometer  for  1/rev  signal 

•  50  piece  balance  weight  kit  with  hex  keys 

•  Optional  configuration  settings  for  popular  balancing  equipment 
Source:  http://altasol.com/rmsO l.htm 


Predictive  Diagnosis  for  Rolling  Bearings 

Although  many  rolling  bearings  are  used  in  a  mechanical  plant,  failure  of  just 
one  bearing  can  result  in  a  total  shutdown.  Acoustic  emission  (AE)  methods 
have  shown  great  promise  in  the  successful  prediction  of  fatigue  in  rolling  bear¬ 
ings.  Analysis  of  the  AE  signal  also  yields  information  regarding  fatigue  crack 
propagation. 


Source:  http://www.mel.go.1p/mainlab/kiso/kis0le.html 
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Infrared  Thermography 

Infrared  thermography  is  a  technology  that  is  used  to  survey  machines  and 
structxmes  to  detect  problems.  Operators  employ  portable  infrared  cameras  to 
convert  thermal  energy  into  high-resolution  images  for  quantitative  temperature 
analysis.  The  images  are  collected  in  minutes  and  problems  are  immediately 
identified.  The  images  can  be  stored  on  a  computer  and  used  for  trending  in  en¬ 
suing  surveys.  Applications  of  infrared  thermography  include  the  following: 

•  Electrical  and  mechanical  maintenance. 

•  Easy  detection  of  overheating  of  bearings,  switchgear,  transformers,  bus¬ 
bars,  overhead  power  lines  and  substations. 

•  Quick  recognition  of  faulty  components  in  electrical  equipment. 

•  In  metal  refining,  smelting  or  sintering  processes,  refractory  wear  can  be 
identified  in  pots,  kilns,  furnaces,  ladles,  and  torpedo  cars. 

•  In  buildings  and  cold  storage  units,  insulation  of  boilers,  pipework  and 
steamtraps,  the  integrity  of  cladding  can  be  easily  monitored. 

Source:  http://www.ozemail.com.au/~its3d/thermog.html 


Weibull  Analysis 

The  Weibull  distribution  analysis  can  be  used  to  predict  failvue  rates  as  well  as 
to  describe  the  failime  of  parts  and  equipment.  The  Weibull  analysis  provides 
information  on: 

•  characteristic  life 

•  standard  deviation  of  hfe 

•  mean  life 

.  •  reliabihty  functions 

•  reliable  life 

•  median  life  initial  failure  rate  per  unit  time. 

Source:  http://www.bassengineering.com/weibull.htm 


Fault  Tree  Analysis  (FTA) 

FTA  is  an  analytical  technique  that  tries  to  combine  all  of  the  factors  that  affect 
the  success  or  failure  of  a  product,  process,  or  mission  into  a  single  FTA  Logic 
Diagram.  A  single  FTA  Logic  Diagram  uses  symbols  called  “Logic  Gates,”,  which 
are  similar  to  the  S3rmbols  used  by  electronic  circuit  designers.  The  FTA  Logic 
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Diagram  proves  to  be  a  sound  method  to  define  the  relationships  between  the 
hardware,  software,  and  human  components  of  a  system. 

The  inputs  to  a  Logic  Gate  (symbol)  depict  the  status  of  a  part  and/or  other  factor 
that  is  being  included  in  the  analysis.  The  output  from  a  Logic  Gate  (symbol)  is 
a  logic  state  that  represents  a  condition  existing  in  a  system.  When  the  output 
from  a  Logic  Gate  changes,  an  event  occurs. 

The  state  is  TRUE  if  a  part  or  other  factor  is  functioning  correctly.  If  the  logic 
statement  is  TRUE,  we  assign  to  it  a  Boolean  logic  value  of  one  (1).  On  the  other 
hand,  the  state  is  FALSE  if  the  part  or  other  factor  is  malfunctioning.  In  this 
case,  we  assign  to  it  a  Boolean  logic  value  of  zero  (0). 

A  Fault  Tree  Analysis  is  actually  performed  by  determining  what  occurs  in  a  sys¬ 
tem  when  the  status  of  a  part  or  other  factor  changes.  There  is  a  minimum  cri¬ 
terion  for  success,  which  is  that  one  single  failure  cannot  cause  injury  or  an  un¬ 
detected  loss  of  control  over  the  process.  In  the  case  where  extreme  hazards 
exist  or  during  the  processing  of  a  highly  valued  product,  the  criterion  may  be 
augmented  to  require  toleration  of  multiple  failures. 

An  FTA  considers  both  positive  and  negative  events.  Logic  tree  segments  that 
lead  to  a  negative  event,  an  accident,  for  example,  define  all  of  the  elements  that 
could  go  wrong  to  cause  the  negative  event.  The  logic  tree  segments  for  negative 
events  are  apt  to  use  more  OR  gates  than  AND  gates,  with  the  exception  of  re¬ 
dundant  safeguards.  Logic  tree  segments  that  lead  to  a  positive  event  define 
everything  that  works  together  for  the  machine  to  operate.  Logic  trees  for  posi¬ 
tive  events,  for  example  maintenance  troubleshooting  trees,  in  general  use  more 
AND  gates  than  OR  gates,  with  the  exception  of  redvmdancy. 

NAND  and  NOR  gates  primarily  define  cotmtermeasures  that,  if  true,  allow  the 
system  to  tolerate  conditions  that  ordinarily  result  in  safety  hazards  or  machine 
failure.  For  more  information  on  Boolesm  or  logic  functions  and  logic  gates,  ac¬ 
cess  the  following  web  site: 

Sources: 

http://www.bassengineering.com/FTA.htm 

http://gatsbv.lit.tas.edu.au/tibs/mprinc/logicb.html 
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Power  Loss  of  High  Speed  Spur  Gears 

Power  loss  of  high  speed  gears  in  industrial  machinery  leads  to  elevation  of  the 
temperature  of  the  gears  and  their  lubricant.  Research  showed  the  relationship 
between  power  loss  and  the  mechanism  of  heat  occiirrence  in  an  effort  to  create  a 
gear  transmission  system  of  higher  efficiency.  Researchers  measured  the  rising 
temperatures  of  the  lubricant  and  gears  that  is  converted  into  power  losses  of  the 
gears.  They  einalyzed  the  sources  and  characteristics  of  power  loss  by  changing 
gear  speed,  tooth  load,  oil  flow  rate,  and  other  running  conditions  of  gears.  The 
researchers  correlated  characteristics  of  power  loss  sources  to  the  gear  tooth 
form  and  other  gear  design  parameters  to  reduce  power  loss. 

Source:  http ://www.mel . go .1  p/mainlab/kiso/kis02e.html 


Artificial  Neural  Network  (ANN)  and  Condition  Monitoring 

In  an  ideal  world,  condition  monitoring  of  a  complex  electromechanical  plant  re¬ 
quires  the  skilled  personnel  with  knowledge  of  these  systems.  However,  with  the 
increasing  depletion  of  such  resources,  ANN  may  offer  a  suitable  alternative. 

Neural  networks  use  a  set  of  processing  elements  or  nodes  that  are  analogous  to 
neurons  in  the  brain.  The  elements  are  interconnected  in  a  network  that  has  the 
ability  to  identify  patterns  in  data  as  the  network  is  exposed  to  the  data.  Figure 
3  shows  a  schematic  of  a  neural  network. 


Figure  3.  The  structure  of  a  neural  network. 
fhttp://www.zsolutions.com/liaht.htm1 
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In  Figure  3,  the  bottom  layer  corresponds  to  the  input  layer,  with  5  inputs  XI 
through  X5.  The  middle  layer,  which  performs  most  of  the  work,  is  known  as  the 
“hidden  layer”  and  contains  a  variable  ntunber  of  nodes.  In  this  figure,  the  out¬ 
put-layer  has  two  nodes,  Z1  Euid  Z2,  that  represent  the  output  values  to  be  de¬ 
termined  from  the  input  values.  Each  node  in  the  hidden  layer  is  connected  to 
the  inputs.  What  is  learned  in  the  hidden  layer  is  based  on  all  of  the  inputs  to¬ 
gether.  In  this  layer,  the  network  learns  interdependencies  in  the  model.  Figure 
4  shows  what  happens  inside  a  hidden  node. 

A  simplified  explanation  of  Figure  4  is  that  a  weighted  sum  is  performed  within 
the  node:  XI  times  W1  plus  X2  times  W2  through  X5  and  W5.  Furthermore,  for 
each  hidden  node  and  each  output  node,  a  weighted  sum  is  performed.  This  rep¬ 
resents  how  interactions  occur  in  the  network. 

Although  a  closed  mathematical  theory  for  linear  time-invariant  systems  exists, 
nonlinear  systems  lack  an  overall  theory.  To  combat  this,  sometimes  nonlinear 
systems  are  linearized  aroimd  their  operating  points.  Subsequently  linear 
methods  applied.  The  design  of  “universal”  modules  and  structures  for  nonlinear 
systems,  which  can  be  used  for  identification,  prediction,  and  control,  remains  an 
important  issue.  However,  ANNs  can  learn  nonlinear  relationships  relatively 
easily  if  sufficiently  measured  data  and  computing  power  are  available.  The 
learning  ability  of  ANNs  may  help  overcome  the  difficult  mathematical  analysis 
required  to  solve  system  identification  and  control  problems  in  complex  and 
highly  nonlinear  systems.  Source:  http://www.zsolutions.com/light.htm 


(http://www.zsolutions.com/liQht.htmi 
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4  Case  Studies  Using  Advanced 
Technoiogies  for  Operation  and 
Maintenance  of  Poiiution  Controi 
Equipment 


Vibration  interpretation  Using  Simuiation  and  the  Intelligence  of 
Networks  (VISION)  Research  Project 

The  VISION  Project  is  a  collaborative  industrial  research  project  under  the  Brite 
EuRam  III  initiative.  The  Project  commenced  on  1  May  1996,  and  was  scheduled 
to  run  for  3  years.  There  are  nine  partners  in  four  coimtries  engaged  in  the 
Project,  which  has  a  total  budget  in  excess  of  3  MECU. 

The  objective  of  this  project  was  to  develop  an  intelligent,  adaptive  monitoring 
and  diagnostic  system.  The  system,  based  on  artificial  intelligence  and  simula¬ 
tion  modules,  analyzes  vibration  spectra  data  to  sustain  high-level  equipment 
reliabihty.  The  artificial  intelligence  derives  fi'om  the  integration  of  neural  net¬ 
works  and  knowledge-based  systems.  The  identified  aims  of  VISION  are: 

•  to  develop  a  first  level  physical  model  (test  rig)  to  represent  a  class  of  simple 
rotating  machines  (a  rotor  suspended  between  two  bearings) 

•  to  develop  a  finite  element  model  to  mimic  the  vibration  signals  generated  by 
the  test  rig  in  both  no  defect  and  defect  modes 

•  to  verify  the  accuracy  of  the  finite  element  model  by  comparing  its  output  to 
actual  data  from  the  test  rig  xmder  controlled  experimental  conditions  and 
from  actual  plant  data 

•  to  develop  an  intelligent  software  module  using  neural  networks  and  knowl¬ 
edge-based  systems  to  optimize  the  parameters  of  the  finite  element  model 
that  will  bring  its  output  close  to  real  data  generated  by  the  test  rig  or 
equivalent  simple  machines 

•  to  develop  a  diagnostic  system  to  determine  the  operating  state  of  the  test  rig 
and  subsequently  the  plant  of  the  end-user,  from  observed  real  data 

•  to  develop  a  more  complex  second  level  physical  model  (test  rig),  by  adding 
couplings  and  additional  rotors  and  bearings 
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•  to  expand  the  finite  model  to  imitate  the  vibration  signals  generated  by  the 
complex  test  rig 

•  to  compare  the  output  from  the  expanded  finite  model  to  actual  data  from  the 
second  level  test  rig  and  actual  plant  data  from  the  end-user  sites 

•  to  expand  the  intelligent  optimization  software  module  to  accommodate  a 
greater  niimber  of  model  parameters  to  represent  more  complex  physical  sys¬ 
tems. 

Source:  http://157.228.102.29/vis-info/vis  home.htm 


Eli  Lilly  Study 

The  goals  of  the  Eli  Lilly  study,  conducted  between  February  and  October  1997, 
were: 

•  to  find  the  most  efficient  approach  to  identify  misalignment  problems  in 
flexible  coupHng  systems 

•  to  isolate  the  source  of  heat  energy  in  a  coupling 

•  to  identify  different  approaches  to  problem  identification 

•  to  identify  problems  that  are  associated  with  over-  and  imder-tension  of  belt 
driven  mechanical  systems  and  the  implications  of  over  lubrication  in  bear¬ 
ings 

•  to  quantify  over  consumption  of  power  in  a  misahgnment  system. 

The  procedure  included  setting  up  an  apparatus  on  which  various  types  of  flexi¬ 
ble  couplings  were  moxmted  between  a  10  horsepower  drive  motor  and  a  driven 
shaft  that  was  adjustable  to  provide  controlled  misalignment.  The  study  em¬ 
ployed  a  Fixturlaser  Shaft  100  laser  alignment  system  with  one-micron  resolu¬ 
tion  to  control  the  positioning  of  the  shafts,  as  precise  positioning  of  the  appara¬ 
tus  was  a  key  to  the  success  of  the  study.  Researchers  recorded  the  following 
observations: 

•  motor  current  signature 

•  motor  temperature  via  thermocouples 

•  motor  and  coupling  temperature  via  infi-ared  thermographic  imaging 

•  motor  bearing  vibration  spectra 

•  coupling  airborne  ultrasoimd  spectra 

•  bearing  contact  ultrasoimd  spectra 

•  load  cell  output. 
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The  study  provided  the  following  general  conclusions  and  recommendations: 

•  In  most  cases  axial  vibration  exceeded  radial  vibration.  This  confirms  the 
rule  that  misalignment  may  be  a  cause  when  axial  vibration  is  as  great  as  50 
percent  of  the  radial  vibration. 

•  Sometimes  misalignment  causes  high  vibration  at  2x  rpm,  indicating  that 
response  characteristics  depend  on  coupling  design  and  speed. 

•  Coupling  design  impacts  the  amplitude  of  the  vibration  when  various  mis¬ 
alignment  conditions  are  present. 

•  Misalignment  diagnosed  solely  from  spectral  data  should  be  verified  using 
phase  data  and  ancillary  technologies. 

Further  study  is  warranted  to  evaluate  the  effects  of  bearing  condition  on  re¬ 
sults. 

Although  the  results  were  mixed,  the  study  supplemented  the  knowledge  bank  of 
information  on  rotating  equipment  and  the  application  of  advanced  maintenance 
technologies  to  detect  problem  situations.  The  study  researchers  advocate  a 
“large  toolbox  approach”  to  determine  problems  in  systems  or  system  compo¬ 
nents.  The  “large  toolbox  approach”  involves  the  consideration  and  possible  ap¬ 
plication  of  a  variety  of  technologies  to  detect  problems  in  mechanical  and  elec¬ 
trical  systems. 

Subsequent  to  the  study,  Eli  Lilly  suggested  the  following  approach: 

1.  Find  problems  or  potential  problems  quickly. 

2.  Prioritize  repairs. 

3.  Make  corrections  as  needed. 

Using  infi"ared  as  a  screening  tool  and  applying  the  practices  mentioned  above, 
Eli  Lilly  has  been  able  to  inspect  and  rate  or  prioritize  three  times  the  amoimt  of 
equipment  for  repair  than  before  instituting  this  approach  (Kelch  1998). 


General  Electric  (GE)  Study  of  Primary  Coiling  Bearing  Run-In 
Equipment 

The  objective  of  the  study  that  commenced  in  May  1998  was  to  design  a  control 
system  for  the  primary  coiling  spindle  run-in  equipment.  The  following  subsys¬ 
tem  was  included  in  the  control  system:  spindle  speed  display,  spindle  speed 
control,  bearing  temperatme  monitor/feedback  control  and  a  PC  user  interface 
with  data  logging  capabilities.  To  ensure  full  life  from  bearings,  it  is  necessary  to 
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run  them  in  before  placing  them  in  service.  The  run-in  system  detects  manufac¬ 
turing  defects  in  the  bearings.  Four  sets  of  bearings  support  the  primary  coil, 
and  although  the  bearings  are  rated  for  18  months  of  operation,  they  are  faihng 
in  6  months.  The  reason  for  the  excessive  failure  rate  may  be  due  to  an  inconsis¬ 
tent  run-in  process.  Subsequently,  General  Electric  plans  to  use  the  run-in  data 
to  predict  failure  and  to  determine  the  cause  of  premature  failures. 

The  nm-in  process  operates  by  var3dng  the  shaft  speed  from  2,000  RPM  to 
30,000  RPM  for  a  lO-hoim  period.  The  three  methods  for  varying  the  shaft 
speeds  for  this  time  period  are: 

1.  To  run  the  shaft  at  maximum  speed  for  a  time,  then  stop  the  shaft  and  allow  the 
bearings  to  cool,  and  then  repeat  the  process 

2.  To  continually  increase  the  shaft  speed  over  the  entire  time  period 

3.  To  increase  the  speed  in  steps  over  the  run-in  period. 

Since  the  step  method  appears  to  cause  some  stress  on  bearings,  GE  researchers 
ran  tests  to  study  step-size  and  step-length. 

To  ascertain  the  minimum  step-size  that  should  be  used,  researchers  investi¬ 
gated  the  d3mamic  system,  based  on  the  following  three  characteristics: 

•  The  possibility  of  speed  instability  at  the  start  of  the  process 

•  The  effects  of  vibration  at  resonant  and  harmonic  frequencies 

•  Speed  surges  occasioned  by  motor  defects. 

Investigators  studied  each  of  the  above  three  characteristics  and  suggested  some 
guidelines  for  determining  a  suitable  step-size: 

•  Instability  could  occur  with  small  step  changes.  Because  one  measures 
speed  by  counting  the  number  of  revolutions  for  a  fixed  time  period,  less  error 
occurs  in  measuring  speed  at  higher  speeds. 

•  Vibration  effects  were  investigated.  Researchers  first  established  a  reso¬ 
nant  frequency  and  plan  to  make  accurate  measurements  to  determine  con¬ 
sistent  natural  frequencies.  Factors  that  affect  natural  frequencies  and  that 
need  to  be  verified  include:  the  bore  diameter  of  the  shaft  collar,  the  bearing 
quality,  and  accelerometer  testing  over  the  range  of  operation  on  spindle  as¬ 
semblies. 

•  Motor  wear  affects  the  determination  of  step-size.  Researchers  found  that  if 
step  size  is  not  greater  than  1,500  RPM,  then  surging  of  the  starting  torque 
may  affect  the  settling  time  of  the  step. 
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The  dynamics  of  the  thermal  system  determine  the  minimum  step-length.  Re¬ 
searchers  used  two  sets  of  sensors  to  conduct  testing  of  the  temperature  sensors, 
and  K-t3^e  thermocouples  to  verify  the  temperature  sensor  readings.  Data  were 
collected  during  the  78-minute  test. 

Investigators  completed  initial  modeling  of  the  run-in  process  and  ran  an  8-hovu’ 
test  to  verify  the  initial  modeling  of  the  system.  The  test  data  and  the  initial 
system  model  were  nearly  identical. 

The  existing  run-in  system  is  an  open  loop  system.  Consequently,  significant 
transients  and  variations  of  speed  occur  during  the  run-in  process.  To  remedy 
this,  it  was  decided  to  convert  the  open  loop  system  to  a  closed  loop  one.  The 
first  step  was  to  analyze  and  model  the  open  loop  system.  The  method  used  was 
to  estimate  the  model  from  input/output  data.  Investigators  required  two  pa¬ 
rameters  from  the  experiment,  the  DC  gain  and  the  motor  time  constant.  When 
these  data  were  obtained  from  the  test-run,  they  were  plotted  to  obtain  the 
steady-state  response  curve.  To  determine  the  transient  response  of  the  plant, 
the  steady-state  relationship  was  used  to  generate  three  input  signals  at  three 
operating  points.  Researchers  plotted  the  data  obtained,  overlaid  with  the  input 
signal.  Inspection  of  the  graphs  showed  that  the  system  could  be  modeled  with  a 
first  order  system.  A  first  order  model  that  provided  a  good  approximation  of  the 
plant  was  produced,  using  a  least  squares  error  modeling  technique. 

After  considering  all  of  the  factors  associated  with  the  run-in  process,  the  re¬ 
searchers  made  recommendations  as  to  step-size,  step-duration,  and  number  of 
steps: 

•  Step-size 

-  No  less  than  2000  RPM 

-  Avoid  16,000  as  a  step 

-  Best  operating  point:  2500-3500  RPM 

•  Step-dxiration 

-  No  less  than  30  minutes 

-  Monitor  long  term  bearing  temperatvire  around  30,000  RPM 

-  Best  operating  point:  30  minutes  -  1.5  ho;irs 

•  Number  of  steps 

-  Greater  than  5 

-  Less  than  20 

-  Best  operating  point:  7-12  steps 

-  GE  researchers  believe  that  the  proposed  new  unit  wiU  improve  the 
run-in  process  for  their  primary  spindles  (Dubrawski  et  al.  1998). 


26 


CERL  TR  99/88 


5  Reliability  Theory 

Reliability  Theory  Development 

Reliability  Theory  (RT)  was  originally  developed  as  a  way  of  describing  the  sta¬ 
tistical  performance  of  equipment  using  measures  like  mean  time-to-failiu’e 
(MTTF),  hazard  rate,  or  system  availability.  But  as  RT  became  more  sophisti¬ 
cated,  researchers  found  that  RT  excels  at  describing  the  effectiveness  of  main¬ 
tenance  policies. 

The  field  of  RT  has  grown  steadily  since  the  1930s.  Reliability  Theor^s  growth 
has  been  spiured  by  increasingly  complex  applications  and  increasingly  sophisti¬ 
cated  statistical  methods.  Arguably,  RT  went  through  its  two  largest  growth 
spm*ts  when  applied  to  vacuum  tube  based  computers  in  the  1940s  and  when 
applied  to  semiconductor  fabrication  plants  in  the  1990s. 

Early  computers  required  thousands  of  vacuum  tubes.  By  today’s  standards, 
they  computed  very  slowly.  Without  optimized  maintenance  procedures,  the 
mean  time  to  failure  could  easily  fall  below  the  time  required  to  run  a  complete 
program.  This  forced  the  electrical  engineers  responsible  for  maintenance  to 
think  carefully  about  the  consequences  of  their  procedmes.  The  optimal  mainte¬ 
nance  policies  these  engineers  invented  are  quite  useful  for  optimizing  mainte¬ 
nance  strategy  in  the  Army  environment. 

During  the  1980s  and  1990s  semiconductor  fabrication  plants  rapidly  increased 
the  density  of  devices  on  a  single  chip.  The  machines  used  to  produce  such  fine 
patterns  on  silicon  wafers  were  terribly  expensive  photolithograph  machines.  In 
addition,  these  photoHthograph  machines  each  had  a  unique  set  of  optical  aber¬ 
rations  that  could  not  be  reproduced  exactly  in  another  machine.  Silicon  chip 
designs  incorporate  several  layers  of  devices  that  have  to  appear  in  exact  regis¬ 
tration  with  other  devices  in  layers  both  above  and  below.  Therefore  when  add¬ 
ing  a  new  layer,  the  wafer  is  brought  back  to  the  same  photolithograph  machine 
several  times.  Plants  which  are  required  to  route  a  part  to  the  same  machine 
several  times  are  said  to  be  “re-entrant.” 

The  scheduling  policies  that  optimize  re-entrant  plant  performance  are  still 
poorly  imderstood  but  several  recent  advances  in  queuing  theory  have  yielded 
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methods  that  are  superior  to  commonly  used  heuristic  (exploratory  self-taught) 
methods.  The  modeling  tools  that  describe  re-entrant  plants  also  describe  the 
maintenance  process  at  the  Department  of  Public  Works  (DPW)  level. 

The  initial  models  obtained  in  RT  describe  the  behavior  of  a  set  of  machines  or 
parts  that  make  up  an  entire  system.  For  many  years  it  was  assumed  that  RT 
would  advance  as  these  models  improved.  This  assumption  was  false  in  several 
important  respects.  From  a  mathematical  standpoint,  there  are  several  common 
time-to-failure  (TTF)  distributions  that  can  easily  be  justified  using  sound  phys¬ 
ics  and  engineering  principles.  Unfortimately  these  functions  are  sufficiently 
similar  to  one  another  that  it  takes  a  large  number  of  experiments  or  voluminous 
field  data  to  determine  which  function  is  correct.  Appendix  A  gives  an  important 
example  of  this.  Furthermore,  there  are  practical  problems  with  such  models.  If 
we  own  a  motor  and  have  a  reasonable  model  describing  its  expected  service  life, 
there  is  little  we  can  do  to  influence  the  situation.  We  are  the  motor’s  owner,  not 
the  manufacturer.  Therefore,  we  only  have  control  over  our  own  maintenance 
policy.  This  is  a  common  situation  and  has  had  a  strong  influence  on  RT  re¬ 
search. 

We  will  show  that  the  true  power  of  RT  lies  in  the  optimization  of  our  own  main¬ 
tenance  policies.  Also,  we  will  show  that  modern  methods  allow  us  to  predict  di¬ 
verse  effects  of  our  maintenance  policies,  including  environmental  impact,  staff¬ 
ing  levels,  budgets,  etc. 

The  balance  of  this  chapter  summarizes  the  important  results  of  RT  .  Actual 
derivations  are  included  in  the  appendixes  to  this  report  for  material  that  is  not 
covered  well  elsewhere.  References  are  given  to  other  publications  that  give 
good  expositions. 


Application  of  Reiiabiiity  Theory 

One  might  ask,  “Do  we  really  care  about  exact  failure  mechanisms?”  In  fact,  the 
answer  might  well  be,  “many  times,  we  do  not.”  At  first  this  may  seem  strange. 
After  all,  we  are  attempting  to  optimize  maintenance  strategy  with  respect  to 
failures  of  individual  components.  A  simple  example  illustrates  our  coimterin- 
tuitive  answer.  Suppose  we  have  a  device,  for  example,  an  electric  motor.  If  we 
conjecture  about  the  likely  failure  mechanisms,  two  are  readily  apparent.  First, 
the  device  could  fail  as  a  function  of  how  many  hours  it  has  been  operated  in  to¬ 
tal.  (An  example  of  this  mechanism  is  called  “wear  out.”)  Second,  we  could 
guess  that  the  device  might  fail  as  a  function  of  how  many  times  it  has  been 
turned  on  and  off.  (An  example  is  a  failure  due  to  “thermal  cycling.”)  We  might 
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expect  the  TTF  distribution  for  each  of  these  processes  to  be  different,  and  we 
might  believe  it  to  be  important  to  know  which  process  dominates  the  service  life 
of  our  device.  Surprisingly,  the  TTF  distributions  of  these  two  mechanisms  are 
so  similar  that  they  have  no  significant  impact  on  our  predictions.  Appendix  A 
gives  a  proof  of  this. 

The  similarity  of  these  two  TTF  distributions  is  a  typical  situation,  which  has 
several  important  consequences.  First,  with  little  data  we  can  easily  come  up 
with  a  good  guess  of  what  the  TTF  distribution  looks  like  and  make  reasonable 
guesses  about  the  predicted  service  hfe  of  the  device.  In  this  way,  we  say  that  RT 
methods  have  “good  predictive  power.”  By  the  same  reasoning,  we  possess  little 
information  about  how  to  extend  the  service  hfe  of  the  device.  If  thermal  cycling 
dominates  the  machine’s  life,  it  would  be  better  to  run  it  less  often  and  for  longer 
periods.  On  the  other  hand,  if  accumulated  run  time  dominates  the  machine’s 
service  life,  it  would  be  better  to  run  the  machine  more  often  for  shorter  periods. 
Because  the  TTF  distributions  are  so  similar,  we  cannot  prescribe  which  course 
to  take.  For  this  reason  we  say  that  RT  has  “poor  prescriptive  power.”  Experi¬ 
ment  is  the  best  way  to  find  out  which  failure  mechanism  dominates. 

Problems  of  this  type  where  there  are  trade-offs  between  predictive  power  and 
prescriptive  power,  and  where  we  are  trying  to  guess  the  TTF  distribution  in  the 
smallest  nrunber  of  experiments,  are  called  problems  of  “system  identification.” 

In  general,  it  takes  an  unacceptably  large  number  of  experimental  trials  to  dis¬ 
criminate  between  two  similar  TTF  distributions.  For  more  detail  see  Wolsten- 
holme  (1999). 


Hazard  Rate  — A  Reliability  Theory  Measure 

Hazard  Rate  is  defined  as  the  fraction  of  working  devices  that  fail  per  unit  time. 

We  shall  show  that  it  is  one  of  the  most  useful  measures  in  classical  RT.  A  deri¬ 
vation  of  the  following  facts  can  be  fovmd  in  (Wolstenholme  1999). 

1.  There  is  a  one-to-one  mapping  between  the  TTF  distribution  and  the  Hazard 
Rate.  That  is,  each  TTF  distribution  has  a  iinique  Hazard  Rate  and  each  Hazard 
Rate  has  a  xmique  TTF  distribution. 

2.  If  the  Hazard  Rate  is  either  constant  or  monotonically  decreasing,  the  system  it 
describes  is  called  a  “happy  system.”  If  the  Hazard  Rate  monotonically  increases, 
the  system  is  called  an  “unhappy  system.” 
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3.  RT  can  be  used  to  derive  an  optimal  maintenance  schedule  for  all  unhappy  sys¬ 
tems. 

4.  RT  gives  no  analytical  method  of  optimal  maintenance  for  happy  systems. 

The  reason  for  facts  3  and  4  is  simple.  The  life  expectancy  of  a  happy  system  is 
either  constant  or  increases  with  time,  assuming  the  system  is  still  running. 
That  is,  if  we  observe  that  a  happy  system  is  still  running,  our  estimate  of  its  life 
expectancy  is  either  as  long  as  a  new  machine,  or  even  longer.  The  prime  exam¬ 
ple  of  a  happy  system  is  software.  Each  time  a  bug  is  removed  from  a  piece  of 
software  the  next  bug  will  either  be  just  as  hard  to  find  as  the  previous  one  or 
even  harder.  If  a  piece  of  software  runs  correctly,  the  best  maintenance  policy  is 
to  let  it  continue  to  run  (i.e.,  “Don’t  mess  with  it”).  That  is,  any  preventative 
maintenance  we  perform  on  the  software  is  more  likely  to  induce  a  new  bug  than 
fix  an,  as  yet,  undiscovered  bug.  (The  reader  may  notice  that  this  is  one  of  the 
reasons  that  the  so  called  Y2K  or  millennivun  bug  is  such  a  big  problem.  Many 
pieces  of  old  software  are  so  well  debugged,  that  companies  use  them  until  they 
are  far  beyond  obsolescence.) 

It  should  be  clear  why  classical  RT  cannot  yield  an  optimal  maintenance  strategy 
for  happy  systems.  If  you  observe  that  your  happy  system  is  running,  the  best 
strategy  is  to  let  it  keep  running! 

An  interesting  sidelight  is  to  notice  that  the  dividing  line  between  happy  and 
unhappy  systems  is  when  the  hazard  rate  is  constant.  This  situation  means  that 
the  TTF  distribution  is  exponential.  This  explains,  in  part,  why  the  exponential 
distribution  is  so  useful  in  RT. 

However,  the  existence  of  happy  systems  poses  a  puzzle.  If  we  need  a  large  sam¬ 
ple  of  systems  to  determine  if  our  system  is  happy  or  unhappy,  how  do  we  say 
anjdhing  meaningful  about  optimal  maintenance  in  the  mean  time?  Arriving  at 
a  meaningful  answer  to  this  question  represents  the  dividing  line  between  clas¬ 
sical  and  modem  RT. 


Can  We  Cope  with  Happy  Systems? 

Should  we  abandon  RT  when  we  do  not  know  if  the  system  is  happy?  No.  Can 
we  say  anything  meaningful  about  optimal  maintenance  of  happy  systems  or 
about  systems  in  general?  Yes.  The  qmckest  way  to  see  this  is  to  study  the 
mathematical  definition  of  system  availability.  If  we  compute  the  fraction  of 
time  that  the  system  is  available  for  work  it  is  easy  to  show  that: 
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Availability  =  MTTF  /  MTTF  +  MTTR  Eq.  2 


where: 

MTTF  =  mean-time-to-failure 
MTTR  =  mean-time-to-repair 

That  is,  the  system’s  availability  depends  on  the  MTTF  which,  in  many  cases,  we 
cannot  influence  much  because  it  was  implicitly  set  by  the  equipment  manufac- 
ttmer.  But  the  systems’  availability  also  depends  on  the  MTTR  over  which  we 
have  complete  control. 

This  is  the  observation  that  makes  RT  useful  in  all  situations,  and  has  the  larg¬ 
est  impact  on  how  optimal  maintenance  systems  are  structured.  In  fact,  many 
modem  researchers  believe  that  this  observation  is  so  important  that  they  refer 
to  the  resulting  maintenance  systems  as  implementing  RCM.  But  drawing  a 
sharp  chronological  dividing  line  between  RT  and  RCM  is  misleading.  It  took 
many  years  for  researchers  in  RT  to  realize  that  this  scenario  of  reducing  MTTR 
arose  repeatedly. 


Non-Destructive  Evaluation  and  Happy  Systems 

Can  a  non-destructive  evaluation  (NDE)  work  with  happy  systems?  Yes.  Recall 
the  definition  of  a  happy  system.  When  conditioned  on  the  observation  that  the 
system  is  still  nmning,  a  system  whose  life  expectancy  remains  constant  or  in¬ 
creases  is  called  a  happy  system.  However,  certain  methods  of  NDE  may  exist 
that  still  accimately  predict  the  demise  of  the  system.  That  is,  conditioned  on  the 
NDE  observation,  the  system’s  hfe  expectancy  can  decrease.  To  those  imfamiliar 
with  probability,  this  can  be  confusing  at  first.  The  easiest  way  to  explain  is  that 
the  NDE  method  may  yield  new  information  beyond  the  simple  observation  that 
“the  system  still  runs.”  NDE  can  give  early  warning  of  system  failure  and  allow 
the  system  operator  to  initiate  procurement  of  appropriate  components  before 
they  fail.  In  this  way,  the  MTTR  can  be  reduced  thus  increasing  system  avail¬ 
ability. 

The  CERL-engineered  management  system  for  pavements  has  been  in  use  at 
various  municipalities  and  installations  for  many  years.  The  PAVER  system 
(Ginsberg,  Shahin,  and  Walther  1990)  has  many  components,  but  this,  exposition 
focuses  on  only  two: 
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1.  There  is  an  adaptive  algorithm  that  predicts  pavement  degradation  as  a  function 
of  time.  The  adaptation  has  two  purposes.  First,  it  can  adapt  to  local  conditions 
such  as  pavement  constituents,  weather,  and  usage.  Second,  this  adaptation  im¬ 
plicitly  gives  rise  to  a  modified  version  of  Hazard  Rate  (for  specialists,  we  are  re¬ 
ferring  to  the  rate  of  change  in  the  pavement  condition  index  [PCI].  Initially,  it 
woxild  seem  that,  if  PAVER  detects  a  constant  or  decreasing  hazard  rate  (con¬ 
stant  or  decreasing  rate  of  PCI  degradation),  that  the  system  should  recommend 
that  the  user  discontinue  usage  of  PAVER. 

2.  A  second  component  of  PAVER  allows  users  to  predict  their  maintenance  needs 
over  an  entire  road  network,  and  as  a  side  effect,  show  them  how  to  plan  their 
out-year  budgeting  with  minimal  variation.  This  allows  maintenance  policy  to 
harmonize  well  with  budget  pohcy.  By  correctly  predicting  the  amount  of  money 
per  year  needed  to  maintain  a  particular  road  network  and  by  minimizing  the 
year-to-year  variation  in  road  repair  dollars,  PAVER  wo\ild  still  be  considered  a 
usefiil  tool  even  in  situations  where  it  does  not  reduce  the  overall  cost  of  pave¬ 
ment  maintenance. 


Reliability  Centered  Maintenance 

RCM  recognizes  the  value  of  an  organization’s  personnel  and  takes  advantage  of 

their  extensive  experience  running  the  facility/equipment.  The  following  catego¬ 
ries  can  be  used  to  assist  in  classifying  maintenance  of  equipment: 

•  Corrective  Maintenance  (CM)  or  “run-to-failure”  works  on  the  assumption 
that  it  is  most  cost-effective  to  allow  equipment  to  nm  imattended  until  it 
fails.  Corrective  Maintenance  is  used  on  the  lowest  priority  equipment. 

•  Preventive  Maintenance  (PM)  is  based  on  performing  maintenance  tasks  on 
equipment  at  regular  intervals,  regardless  of  whether  maintenance  is  actu¬ 
ally  needed  at  the  time. 

•  Predictive  Maintenance  (PDM)  is  based  on  real-time  data  collected  fi*om  a 
piece  of  equipment.  These  data  show  the  current  status  of  the  equipment. 

•  Proactive  Maintenance  (PAM)  determines  the  root  causes  of  failure.  This  in¬ 
volves  going  to  the  manufactmer  for  equipment  redesign  to  avoid  future 
breakdowns  of  the  equipment  (Reliability  Centered  Maintenance  [RCM]  - 
Tutorial  and  Application). 


Harmonizing  Reliability  Centered  Maintenance  with  Army  Policy 

RCM  can,  on  occasion,  reconcile  disparate  views  of  maintenance  policy.  In  the 
early  1990s,  the  Army  decided  on  the  basis  of  impending  reduction  in  funding 
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that  the  infrastructure  part  of  the  budget  would  be  cut  rather  than  the  training 
budget.  The  decision  was  made  that  the  policy  of  run-to-failure  would  be  the 
most  cost-effective  means  of  equipment  maintenance.  In  the  Arm/s  accounting 
system,  maintenance  dollars  come  from  a  fixed  funding  line  item.  It  makes  no 
sense  to  organize  an  optimal  maintenance  program,  because  the  cost  of  inspec¬ 
tion  and  early  replacement  would  only  deplete  maintenance  funds.  But  in  the 
event  of  equipment  breakage,  the  Army  accoimting  system  produces  a  minor 
miracle.  The  cost  of  repairing  broken  equipment  on  an  emergency  basis  comes 
from  the  “capital  improvements”  budget  that  is  separate  from  the  maintenance 
budget.  By  using  a  strict  run-to-failure  pohcy,  equipment  managers  can  maxi¬ 
mize  the  funds  used  on  their  equipment. 

Unfortunately,  it  has  been  known  for  many  years  that  this  run-to-failure  policy  is 
the  single  fastest  way  to  reduce  the  MTTF  and  decrease  system  availability.  An 
example  of  this  concept  is  the  maintenance  of  vacuum  tube  computers.  We  will 
outline  the  methods  here  using  this  example  because  it  contains  several  instruc¬ 
tive  simplifications  that  will  become  apparent  shortly. 

Consider  a  computer  made  up  of  a  large  number  of  vacuum  tubes,  each  of  which 
has  the  MTTF  distribution  shown  in  Figure  5.  If  we  begin  using  a  new  computer 
containing  new  vacutun  tubes  and  replace  tubes  as  they  fail  during  the  first  wear 
out  cycle,  the  number  of  tubes  replaced  per  rmit  time  will  look  like  the  first 
hump  in  Figure  6. 


Figure  5.  Distribution  of  lamp  life  (Bazovsky  1961). 
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Figure  6.  Wearout  curves  of  three  lamp  generations  (Bazovsky  1961). 

Notice  that  we  enjoy  a  long  period  of  reliable  operation  xintil  the  tubes  begin  to 
wear  out.  In  the  second  wear  out  cycle,  the  number  of  tubes  replaced  per  unit 
time  wQl  look  like  the  second  htimp  of  the  graph  in  Figure  6.  The  third  wear  out 
cycle  will  look  like  hump  3  in  Figure  6  and  so  forth.  (Note  for  specialists:  The 
shape  of  hump  “n”  is  the  convolution  of  hximp  n-1  with  hump  1.)  The  number  of 
tubes  replaced  per  unit  time  will  be  the  sum  of  all  of  these  humps  and  will  look 
like  the  upper  line  of  Figure  7.  For  further  details,  see  Bazovsky  (1961). 


Figure  7.  Stabilization  of  failure  frequency  (Bazovsky  1961). 
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As  time  goes  on,  we  will  be  replacing  tubes  constantly,  this  means  that  we  have 
driven  the  MTTF  down  as  far  as  it  can  go,  and  have  exceptionally  low  availabil¬ 
ity. 

An  alternative  explanation  is  as  follows.  This  is  the  reason  people  buy  new  cars! 
The  idea  behind  buying  a  new  car  is  that  all  of  the  parts  of  the  car  are  new  at  the 
same  time.  This  is  analogous  to  the  large  time  between  when  we  begin  running 
the  computer  and  the  first  wear  out  cycle  in  Figure  6.  As  a  car  gets  older,  its 
components  are  replaced  as  needed  so  they  are  widely  varying  ages,  and  it  seems 
like  something  is  always  going  wrong.  This  is  analogous  to  the  high  time  be¬ 
havior  of  Figure  7. 

Thus  we  see  that  a  run-to-failure  policy  is  terrible  for  long  term  availability.  Is 
there  an  alternative  way  to  interpret  Army  policy,  maximize  our  maintenance 
budget,  and  maximize  the  MTTF?  Yes.  If  we  keep  track  of  mean  service  life  of 
the  unhappy  components  in  our  system  an  alternative  policy  is  possible.  We  wait 
until  breakdown.  This  lets  us  tap  into  the  capital  improvements  budget.  We 
then  use  these  funds  to  fix  the  broken  component  and  replace  all  other  compo¬ 
nents  that  are  beyond,  or  near  their  expected  service  life.  This  is  sometimes 
called  “renewing  maintenance”  in  that  it  tries  to  achieve  the  longest  possible  pe¬ 
riod  of  high  reliability  between  breakdowns.  In  this  way  we  try  to  repeat  the 
high  reliability  period  before  the  first  wear  out  peak  of  Figure  6,  again  and 
again.  In  the  era  of  the  vacuum  tube  based  computers,  engineers  settled  on  the 
following  maintenance  procedure.  Run  the  machine  until  the  first  vacuum  tube 
blows,  then  replace  all  tubes  in  the  blown  tube’s  sub-system.  This  strategy  was 
found  to  yield  acceptable  MTTF. 
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6  Queuing  Theory 

A  recent  approach  to  RCM  has  come  from  new  methods  of  queuing  analysis. 
Queuing  theory  addresses  the  analysis  of  systems  where  many  jobs  are  waiting 
for  some  service.  Appendix  B  includes  a  siunmary  of  our  current  understanding 
of  the  behavior  of  closed  queuing  networks.  It  is  important  to  remember  that 
this  is  cxurently  an  active  area  of  research.  The  contents  of  Appendix  B  are  cxir- 
rent,  but  will  be  quickly  outdated  over  the  next  year. 

Queuing  theory  analysis  can  yield  important  information  about  the  efficacy  of 
maintenance  procedures.  From  a  queuing  perspective,  maintenance  procedures 
are  merely  networks  of  jobs  that  await  service  from  maintenance  personnel.  Al¬ 
though  queuing  theory  has  been  used  in  one  form  or  another  since  the  early 
1900s,  recent  advances  allow  the  analysis  of  much  more  complex  networks.  In 
the  context  of  RCM,  the  most  important  developments  of  quemng  theory  can 
cope  with  several  complexities  common  to  most  maintenance  shops. 


Re-Entrant  Lines 

In  maintenance  shops,  the  same  job  may  require  service  from  a  single  person 
several  different  times.  For  example,  personnel  responsible  for  procmement 
may  see  a  single  job  several  different  times  in  the  procurement  cycle. 


Multiple  Job  Classes 

Maintenance  personnel  have  many  different  types  of  jobs  waiting  for  them  at 
any  one  moment.  They  do  not  choose  the  next  job  randomly.  They  can  distin¬ 
guish  one  job  class  from  another  and  make  a  decision  consciously. 


Service  Policy 

At  any  one  moment,  one  maintenance  worker  may  have  several  different  jobs  of 
different  classes  waiting  for  service.  The  prioritization  the  worker  assigns  to 
these  waiting  jobs  is  called  the  service  policy. 
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Stochastic  Routing 

In  many  networks,  jobs  can  be  split  up  among  maintenance  personnel. 


Pre-emptive  Scheduling 

If  em  especially  important  job  arrives  for  service  from  a  particular  worker  or  set 
of  workers,  all  other  work  may  be  suspended  in  order  to  service  the  important 
task. 


Down  Time 

Maintenance  workers,  machines,  or  tools  can  have  varying  availabiUty  based  on 
their  own  TTF  and  time-to-repair  (TTR)  statistics. 


Large  Networks 

This  new  style  of  analysis  can  cope  with  enormous  and  complex  queuing  net¬ 
works.  The  technique  was  designed  to  cope  with  re-entrant  routing  in  semicon¬ 
ductor  fabrication  plants  where  many  products  are  in  process  simultaneously. 
For  specialists  requiring  a  detailed  introduction  to  this  style  of  analysis,  we  rec¬ 
ommend  the  article  “Closed  queuing  networks  in  heavy  traffic:  Fluid  limits  and 
efficiency”  (Kumar  and  Kumar  1996). 
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7  Queuing  Theory  Modeling 


One  of  the  objectives  of  the  project  was  to  investigate  the  application  of  quexiing 
theory  modeling  as  an  operations  and  maintenance  tool  for  pollution  control 
equipment.  A  queuing  model  usually  includes  one  or  more  servers  that  render 
the  service,  a  pool  of  customers,  and  some  description  of  the  arrival  and  service 
processes.  If  there  is  more  than  one  queue  for  a  server,  then  there  may  also  be 
some  policy  regarding  which  customer  receives  service.  In  working  with  queuing 
systems,  it  is  easiest  to  analyze  the  system  in  steady  state  (after  the  system  has 
started  up  and  things  have  settled  down).  Our  analysis  allows  us  to  optimize  for 
one  or  more  of  these  performance  measmes. 


Throughput 

How  fast  do  jobs  go  through  the  maintenance  system?  Can  one  predict  how  long 
a  particular  job  will  take?  Recent  analysis  of  semiconductor  fabrication  plants 
indicates  that  these  new  methods  of  analysis  were  able  to  speed  up  fabrication 
by  30  percent  and  decrease  the  variability  of  completion  times  by  60  percent. 


Queue  Length 

How  many  jobs  are  likely  to  be  waiting  for  any  one  maintenance  worker  at  a 
given  moment?  Can  a  job  be  expedited  without  throwing  off  the  schedule  of 
other  jobs? 


Failure  and  Repair  Statistics 

Recent  results  show  how  to  predict  the  equilibrium  points  of  queuing  networks 
(Ginsberg  and  Kmnar  1997).  This  allows  prediction  of  TTF  and  TTR  statistics 
for  the  entire  network  and,  hence,  the  rehability. 
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Staff  Levels 

By  calculating  “what  ir  scenarios  with  software,  the  number  of  maintenance 
workers  and  responsibilities  can  be  optimized. 


Budgets 

Are  budgets  made  more  predictable?  Yes.  This  is  not  an  obvious  result.  Because 
the  vsuiability  of  the  output  of  the  system  is  reduced  (see  “Throughput”  above) 
the  out-year  budget  planning  is  more  predictable. 


Ecological  Impact  of  ROM 

Since  “total  time  to  repair”  (TTR)  statistics  can  be  calculated  for  all  machines, 
systems  and  subsystems,  other  performance  statistics  can  be  deduced  easily.  Of 
particular  interest  to  the  Army  are  the  TTR  statistics  for  pollution  control 
equipment.  The  environmental  impact  of  Army  operations  can  be  quantified  as 
the  average  cost  incurred  given  the  reliability  of  the  system.  In  this  way  the 
maintenance  personnel  may  be  able  to  justify  increased  expenditures,  internal 
priority  changes,  and  improved  procurement  docvunents  because  they  can  predict 
the  total  cost  to  the  Army  in  advance. 
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8  Conclusion 


This  study  reviewed  innovative  methods  of  performing  maintenance  of  pollution 
control  equipment,  specifically  for  application  at  Army  installations.  Previous 
CERL  work  has  shown  that  RCM  usually  depends  on  test  regimens  rather  than 
approaching  the  subject  fi’om  a  statistical  viewpoint  even  though  statistics  have 
been  used  in  manufacturing  quite  successfully.  Statistical  maintenance  model¬ 
ing,  even  when  approached  from  different  initial  viewpoints,  reveals  the  impact 
of  different  maintenance  policies.  Large  scale  systems  have  escaped  effective 
analysis  imtil  recently.  This  study  has  summarized  recent  attempts  to  model 
large  systems  by  using  a  model  based  on  queuing  theory.  Queuing  theory  makes 
statistical  predictions  that  summarize  how  long  it  takes  to  repair  a  broken  sys¬ 
tem,  and  predicts  the  availability  of  the  system. 

By  stud3dng  the  methods  being  used  now  for  RGM  and  the  mihtary’s  predilection 
to  run-to-failure  maintenance,  this  research  has  produced  a  linem  program  from 
the  queuing  model  that  can  help  reduce  the  downtime  of  the  equipment,  which  is 
the  intent  of  RCM. 

This  study  concludes  that  using  these  recent  methods  may  allow  U.S.  Army  in¬ 
stallations  to  optimize  maintenance  policy  for  minimal  ecological  impact. 
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Appendix  A:  Time-to-Failure  Distributions 


Equipment  that  fails  randomly  over  its  accumulated  run  time  has  an  exponen¬ 
tial  time-to-failime  distribution.  This  distribution  is: 


fit)  =  ^ 


-A/ 


The  quickest  way  to  see  that  this  is  correct  is  to  notice  that: 


ff(Odt  =  jAe-^-'dt  =  l. 

-00  0 


(i.e.,  Rt)  is  an  actual  distribution),  and  that: 

/(t) 


H{t)  = 


\-F{t) 


i.e.,  the  exponential  distribution  is  unique  in  having  a  constant  hazard  rate.  The 
probability  of  failure  on  or  before  time  t  is: 

0 


This  is  an  important  result,  as  will  be  shown  shortly. 

Now  consider  a  system  that  fails  due  to  on-off  cycles,  for  example,  via  thermal 
cycling.  Suppose  the  probability  that  the  system  fails  in  any  one  cycle  is  “p,” 
then  the  probability  that  it  fails  on  cycle  n  is: 

p{n)  =  {\-  p)"''  p . 


That  is,  the  system  ran  n-1  times  and  failed  on  the  n-eth  trial.  This  is  a  Ber¬ 
noulli  distribution  with  parameter  p. 

There  are  several  ways  to  observe  that  the  Bernoulli  distribution  is  closely  re¬ 
lated  to  the  exponential  distribution. 
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First,  consider  what  we  might  mean  by  “a  trial”  in  the  Bernoulli  distribution. 
For  an  electric  motor  that  vmdergoes  thermal  cycling  each  time  power  is  applied, 
this  is  obvious.  Each  time  power  is  applied  to  the  motor,  we  call  this  a  trial. 
However,  the  distinction  is  not  always  as  clear.  Suppose  we  are  meteorologists, 
and  want  to  talk  about  quantities  like  a  “30-year  rain.”  At  first,  we  may  consider 
that  this  means  we  are  talking  about  a  Bernoulli  distribution  with  p  =  1/  SOyears 
and  n  =  nximber  of  years.  In  this  way  the  chance  of  observing  such  a  rain  within 
n  years  is: 

±{\-prp=\-{\-pr\ 

/i=0 


This  result  looks  reasonable  at  first,  but  on  closer  inspection,  this  answer  has 
several  features  that  are  difficult  to  explain.  If  n  =  number  of  years,  this  answer 
makes  no  sense  from  the  standpoint  of  dimensional  analysis  where  the  exponent 
should  be  dimensionless.  If  there  is  more  than  one  rain  per  year,  then  the  num¬ 
ber  of  trials  should  not  be  in  units  of  years,  but  in  imits  of  rainstorms.  This  is 
easily  fixed  by  introducing  a  constant  alpha  =  rains  per  year,  and  the  above  re¬ 
sult  becomes: 


Tbis  lays  bare  the  real  objection.  This  answer  should  be  invariant  by  choice  of 
alpha  (but  is  not).  The  time  units  in  which  we  measiire  the  result  should  not 
change  the  answer.  Luckily  we  can  take  a  limit  in  alpha: 


lim 

a-^co 


1- 


aj 


\-e 


-pn 


which  is  clearly  the  exponential  distribution  associating  p  with  lambda  and  n 
vdth  t. 

Another  way  is  to  calculate  the  discrete  version  of  hazard  rate: 

Pj^)  _  {^-pT'p  _  ^  P 

l-l  +  (l-;?r'  {\-pf  ’ 

rt=0 

which  shows  that  the  discrete  hazard  rate  is  constant. 

Finally,  note  that  the  graphs  of  these  two  functions  are  quite  similar  (Figure  Al). 


Probability  of  failure 
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Appendix  B:  Closed  Queuing  Networks 


Recently  much  progress  has  been  made  on  the  study  of  closed  queuing  networks 
with  a  single  route,  i.e.,  when  the  routing  matrix  is  irreducible.  Let  T^{N)  de¬ 
note  the  throughput  of  such  a  network  when  the  population  size  is  N ,  and  the 
scheduling  policy  u  is  employed.  For  Markovian  queuing  networks  with  a  single 
route,  Jin,  Ou,  and  Kximar  were  able  to  establish  upper  and  lower  bounds  of  the 
form: 


N  7-*>  ^  J 


N  +  v 


N  +  v 


for  all  N. 


Above,  r*  is  the  throughput  capacity  of  the  system.  Clearly  y  is  a  lower  bound 
on  the  asymptotic  loss  which  is  defined  as: 


Ar->a)  jp  * 

When  the  quantity  T  in  the  lower  bound  is  also  equal  to  T*  then  it  follows  that: 

limr^(A^)=r* 


and  the  network  operated  under  the  schedialing  policy  is  said  to  be  efficient. 
Moreover,  in  this  circumstance,  v  is  an  upper  bound  on  the  asymptotic  loss.  In 
Jin,  Ou,  and  Kumar  linear  programs  are  provided  for  determining  the  quantities 
V ,  and  T*.  The  advantage  of  having  throughput  boimds  as  above  is  that 
they  are  functional  bounds.  i.e.,  they  are  valid  for  all  N.  Hence  the  determina¬ 
tion  of  y,  V ,  T,  and  T*  serves  to  bound  the  performance  curve  of  the  network  vm- 
der  the  given  scheduling  policy. 

In  Kumar  and  Kumar  a  different  approach  was  taken.  There  the  fluid  model 
approach  pioneered  by  Rybko  and  Stolyar  (1991)  and  Dai  (1995)  was  extended  to 
the  study  of  closed  networks.  It  was  established  that  the  entire  class  of  Last 
Buffer  First  Serve  (LBFS)  scheduling  policies  is  efficient  for  the  case  of  a 
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deterministic  closed  route,  i.e.,  closed  re-entrant  lines.  These  are  schediiling 
policies  based  on  buffer  priorities,  where  some  arbitrary  buffer  is  designated  as 
the  “last”  buffer,  and  priority  is  given  to  buffers  that  are  closer  to  the  end.  Thus 
the  stability  of  LBFS  for  open  systems  was  extended  to  closed  systems. 

Earlier,  using  a  formal  approach  based  on  Brownian  networks,  Harrison  and 
Wein  (1990)  had  conducted  an  analysis  of  closed  networks  with  two  stations,  and 
conjectured  that  a  specific  policy,  called  the  HW-policy,  was  asymptotically  opti¬ 
mal.  By  this  is  meant  that  its  asymptotic  loss  is  less  than  that  of  all  other  poli¬ 
cies.  This  was  done  by  examining  a  particular  RBM  process,  and  conjecturing  a 
formula  for  the  asymptotic  loss  of  all  buffer  priority  policies.  The  particular  HW 
policy  was  characterized  by  certain  indices  for  buffers,  called  the  HW-indices, 
which  are  used  to  prioritize  the  buffers.  In  Kumar  and  Kumar  (1994)  it  was  also 
proved  that  the  HW  policy  is  indeed  efficient  for  all  two  station  closed  re-entrant 
lines. 

The  HW  policy  and  the  conjectured  asymptotic  loss  formula  were  examined  in 
Jin,  Ou,  and  Kumar.  There  it  was  established  that  its  value  for  T  was  indeed 
equal  to  T*,  thus  establishing  its  efficiency.  Moreover,  it  was  established  that  no 
policy  could  have  an  asymptotic  loss  strictly  smaller  than  the  conjectured  loss  of 
the  HW  policy.  Simultaneously,  an  additional  condition  was  identified  in  Jin, 
Ou,  and  Kumar,  which  is  missing  in  the  work  of  Harrison  and  Wein  (1990)  and  it 
was  established  that  under  this  condition,  all  non-idling  policies  are  indeed  effi¬ 
cient,  and  moreover  no  such  policy  could  have  an  asjrmptotic  loss  strictly  greater 
than  that  of  the  exact  opposite  of  the  HW  policy,  dubbed  there  as  the  Anti-HW 
policy. 

Earlier,  it  was  already  established  in  Harrison  and  Nguyen  (1958)  that  the 
closed  version  of  the  system  in  Kumar  and  Seidman  (1990)  and  Lu  and  Kumar 
(1991)  was  indeed  inefficient,  thus  establishing  that  not  all  pohcies  can  have  fi¬ 
nite  as3Tnptotic  loss,  and  thus  also  that  the  conjectured  formula  for  asymptotic 
loss  in  Harrison  and  Wein  (1990)  cannot  hold  in  full  generality. 

More  recently,  Morrison  and  Kumar  (1996)  have  turned  to  examining  the  issue  of 
necessary  conditions  for  efficiency  of  all  non-idling  scheduling  policies  in  closed 
re-entrant  lines.  They  have  established  that  the  condition  earlier  identified  by 
Jin,  Ou,  and  Kumar  as  being  sufficient  for  efficiency  is  actually  necessary  too, 
when  the  inequality  is  allowed  to  be  non-strict. 

All  these  studies  however,  have  been  confined  to  closed  networks  with  just  one 
loop.  In  many  application  areas  however,  several  closed  loops  may  simultane¬ 
ously  exist.  One  example  is  communication  networks  where  several  origin- 
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destination  pairs  are  each  controlled  by  window  based  flow  control  policies.  A 
widespread  example  of  this  is  TCP/IP,  which  regulates  the  window  size  d5mami- 
cally  so  as  not  to  overload  the  network.  Another  example  is  manufacturing  sys¬ 
tems  where  several  part  types  are  made.  If  the  number  of  parts  of  each  type  is 
regulated  at  a  fixed  level,  then  again  one  obtains  closed  networks  with  multiple 
routes. 


The  purpose  of  the  present  paper  is  the  study  of  such  closed  systems  with  multi¬ 
ple  routes.  In  such  systems  there  is  a  population  vector  made  of  one  population 
level  for  each  route.  Also  one  has  a  vector  throughput  where  each  component  is 
the  throughput  of  a  particular  route.  Our  goal  is  to  study  the  behavior  of  the 
vector  throughput  as  a  function  of  the  vector  population,  both  for  particular 
buffer  priority  policies  as  well  as  the  class  of  all  non-idhng  policies. 


Consider  a  network  with  B  stations  labeled  bj...bg.  There  are  L  routes  (or  loops) 
through  the  system.  The  routes  are  specified  by  a  routing  matrix  p»,  which  is  the 
probability  of  moving  next  to  buffer  j,  after  having  visited  buffer  i.  Buffer  i  is 
served  by  station  a(i).  Customers  in  buffer  bj  require  an  exponentially  distrib¬ 
uted  service  time  with  mean  1/p.. 

Let  T^’''[N)  denote  the  throughput  of  route  r  imder  a  given  scheduling  policy  u 

when  the  population  vector  is  N.  Here  N=(Nj,...,Nl)  where  is  the  population  of 
loop  route  L.  Also  let  v=(Vj,...,Vg)  denote  the  asymptotic  loss,  when  the  popula¬ 
tion  proportions  in  the  routes  are  held  constant,  but  the  total  population  in¬ 
creases  to  infinity.  The  main  results  are  the  following. 


Theorem  1:  Bound  on  weighted  throughput. 


Let  be  the  fraction  or  the  total  population  stored  in  loop  L  of  a  closed  loop  sys¬ 
tem.  Let  71  be  the  concatenation  of  the  steady-state  probability  vectors  for  com¬ 
municating  classes  of  the  routing  matrix  P,  i.e.,  of  the  various  loops  so  that 
;r,.  =  1 .  Consider  first  the  following  linear  program  with  decision  variables 

{T\  q,,  w„  u„  O: 


max 


\  L 


subject  to: 
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Denote  by  T*  the  value  of  this  LP.  Consider  now  a  second  linear  program  with 
the  same  set  of  decision  variables: 

min(v) 


subject  to: 
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/eL 


Denote  the  value  of  this  LP  by  y. 


Then: 


r(N)<^^T*. 
■  ^  N  +  v 


Theorem  2:  Lower  functional  bovmd  on  weighted  throughput. 


Consider  first  the  linear  program: 


r 


max 


Vi  y 


subject  to; 


,  V  Uj>  //,. 


pi'kp(j) 


-<ljj  +9y,/+l 


lU)  \ 
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^  ft 


-<ljj  +9j,M  - 


fi 


L{J)  \ 
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Denote  by  T  the  value  of  the  LP,  and  consider  the  second  linear  program: 
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min(v) 
subject  to: 


i-j 


a(i)=(7{j) 
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T>Y,cc‘-T^  . 


Let  V  denote  the  value  of  the  LP. 
Then: 

r{N)>-^T . 

^  N  +  v- 


These  bounds  can  be  used  to  study  the  behavior  of  loop  interaction,  priority  in¬ 
teractions,  and  population  level  interactions  in  closed  queuing  networks  with 
mvdtiple  roots.  Two  illustrative  examples  are  provided  below  to  illustrate  the 
technique  as  well  as  illuminate  some  interesting  phenomena  in  miilti-loop  sys¬ 
tems. 
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Example  1. 

Consider  the  system  shown  in  Figure  Bl,  with  the  means  of  the  service  times  Ji 
=  {1/3,  2/7,  3/11, 1,  3,  2},  and  giving  priority  to  buffers  bj,  bg  and  bg.  Designate  the 
loop  containing  buffers  bj,  b^  and  bg  as  loop  1  or  “the  slower  loop”  due  to  its  serv¬ 
ice  rates.  Using  the  theorems  above,  the  infinite  population  throughput  extrema 
are  found  to  be  the  same  (T*  =  T)  independent  of  the  choice  of  loop  population 
fraction  (  a )  and  weighting  vector  ( c  )  (see  Figure  B2).  Further,  the  throughput 
is  invariant  with  changes  in  loop  fraction  population  {a).  However,  plotting  the 
asymptotic  loss  does  show  a  variation  with  loop  fraction  population  {a).  These 
can  be  seen  in  Figure  B3  showing  y  ,  and  Figure  B4  showing  v  . 

Clearly,  with  buffers  bj,  bg  and  bg  having  priority,  the  system  favors  the  slow  loop. 
As  we  increase  the  total  population,  we  expect  that  the  system  will  slow  down  as 
the  fast  loop  stagnates  for  lack  of  service  time.  Figures  B3  and  B4  indicate  that 
this  transition  takes  place  more  slowly  with  increasing  total  population  when  the 
population  distribution  favors  the  faster  loop. 

Example  2. 

Retaining  the  same  system  configuration  from  Figure  B2,  consider  now  the  mean 
service  times  Ji  ={1/3,  2,  3/11,  1,  3,  2/7},  and  giving  priority  to  buffers  bj,  bg,  and 
bg.  Figure  B5  shows  that  T*^T  for  every  value  of  loop  popvdation  fraction  ( a ) 

and  weighting  factor  ( c ).  This  is  not  a  trivial  observation.  Figure  B6  shows  the 
siimmary  of  the  throughputs  obtained  from  one  long  simulation  conducted  at 
each  of  a  large  number  of  population  levels,  with  population  fractions  fixed  at 
(  a  ={1/2,  1/2}).  An  explanation  for  this  intriguing  graph  may  be  the  presence  of 
two  basins  of  attraction  in  the  Markov  state  space.  The  commimication  rate  be¬ 
tween  the  two  basins  of  attraction  is  extremely  small  so  that  once  the  system  is 
captured  by  one  of  the  two  attractors,  it  stays  there  for  the  duration  of  the 
simulation.  This  result  also  demonstrates  that  the  practice  of  “mixing”  or  taking 
an  average  of  simulation  results  can  be  quite  misleading  in  queuing  applications. 
Finally,  the  result  indicates  that  flow  control  schemes  based  on  window  size  may 
succeed  or  fail  based  on  the  transient  characteristics  of  the  network  and  not 
steady-state  behavior. 


Throughput  Max  and  Min 
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Figure  B3.  Lower  bound  asymptotic  loss  as  a  function  of 
loop  fraction  population  and  weighting  vector  (Ginsberg). 


Figure  B4.  Upper  bound  asymptotic  loss  as  a  function  of 
loop  fraction  population  and  weighting  vector  (Ginsberg). 
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Figure  B5.  Theoretical  bounds  and  simulation  of  throughput  as 
a  function  of  total  population  (Ginsberg). 
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